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Abstract 

Relational  learning  is  concerned  with  predicting  unknown  values  of  a  relation,  given  a  database  of  entities 
and  observed  relations  among  entities.  An  example  of  relational  learning  is  movie  rating  prediction,  where 
entities  could  include  users,  movies,  genres,  and  actors.  Relations  would  then  encode  users’  ratings  of  movies, 
movies’  genres,  and  actors’  roles  in  movies.  A  common  prediction  technique  given  one  pairwise  relation,  for 
example  a  fusers  x  ^movies  ratings  matrix,  is  low-rank  matrix  factorization.  In  domains  with  multiple 
relations,  represented  as  multiple  matrices,  we  may  improve  predictive  accuracy  by  exploiting  information 
from  one  relation  while  predicting  another.  To  this  end,  we  propose  a  collective  matrix  factorization  model: 
we  simultaneously  factor  several  matrices,  sharing  parameters  among  factors  when  an  entity  participates  in 
multiple  relations.  Each  relation  can  have  a  different  value  type  and  error  distribution;  so,  we  allow  nonlinear 
relationships  between  the  parameters  and  outputs,  using  Bregman  divergences  to  measure  error.  We  extend 
standard  alternating  projection  algorithms  to  our  model,  and  derive  an  efhcient  Newton  update  for  the 
projection.  Furthermore,  we  propose  stochastic  optimization  methods  to  deal  with  large,  sparse  matrices. 
Our  model  generalizes  several  existing  matrix  factorization  methods,  and  therefore  yields  new  large-scale 
optimization  algorithms  for  these  problems.  Our  model  can  handle  any  pairwise  relational  schema  and  a 
wide  variety  of  error  models.  We  demonstrate  its  efficiency,  as  well  as  the  benefit  of  sharing  parameters 
among  relations. 
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1  Introduction 


Relational  data  consists  of  entities  and  relations  between  them.  In  many  cases,  such  as  relational  databases, 
the  number  of  entity  types  and  relation  types  are  fixed.  Two  important  tasks  in  such  domains  are  link 
prediction,  determining  whether  a  relation  exists  between  two  entities,  and  link  regression,  determining  the 
value  of  a  relation  between  two  entities  given  that  the  relation  exists. 

Many  relational  domains  involve  only  one  or  two  entity  types:  documents  and  words;  users  and  items; 
or  academic  papers  where  attributed  links  between  entities  represent  counts,  ratings,  or  citations.  In  such 
domains,  we  can  represent  the  links  as  an  m  x  n  matrix  X:  rows  of  X  correspond  to  entities  of  one  type, 
columns  of  X  correspond  to  entities  of  the  other  type,  and  the  element  Xij  indicates  either  whether  a 
relation  exists  between  entities  i  and  j.  A  low-rank  factorization  of  X  has  the  form  X  «  f{UV"^),  with 
factors  U  S  and  V  £  Here  fe  >  0  is  the  rank,  and  /  is  a  possibly- nonlinear  link  funetion. 

Different  choices  of  /  and  different  definitions  of  w  lead  to  different  models:  minimizing  squared  error  with 
an  identity  link  yields  the  singular  value  decomposition  (corresponding  to  a  Gaussian  error  model),  while 
other  choices  extend  generalized  linear  models  [28]  to  matrices  [14,  17]  and  lead  to  error  models  such  as 
Poisson,  Gamma,  or  Bernoulli  distributions. 

In  domains  with  more  than  one  relation  matrix,  one  could  fit  each  relation  separately;  however,  this 
approach  would  not  take  advantage  of  any  correlations  between  relations.  For  example,  a  domain  with  users, 
movies,  and  genres  might  have  two  relations:  an  integer  matrix  representing  users’  ratings  of  movies  on  a 
scale  of  1-5,  and  a  binary  matrix  representing  the  genres  each  movie  belongs  to.  If  users  tend  to  rate  dramas 
higher  than  comedies,  we  would  like  to  exploit  this  correlation  to  improve  prediction. 

To  do  so,  we  extend  generalized  linear  models  to  arbitrary  relational  domains.  We  factor  each  relation 
matrix  with  a  generalized-linear  link  function,  but  whenever  an  entity  type  is  involved  in  more  than  one 
relationship,  we  tie  factors  of  different  models  together.  We  refer  to  this  approach  as  collective  matrix 
factorization. 

We  demonstrate  that  a  general  approach  to  collective  matrix  factorization  can  work  efficiently  on  large, 
sparse  data  sets  with  relational  schemas  and  nonlinear  link  functions.  Moreover,  we  show  that,  when  relations 
are  correlated,  collective  matrix  factorization  can  achieve  higher  prediction  accuracy  than  factoring  each 
matrix  separately.  Our  code  is  available  under  an  open  license.^ 


2  A  Unified  View  of  Factorization 

The  building  block  of  collective  factorization  is  single-matrix  factorization,  which  models  a  single  relation 
between  two  entity  types  £i  and  £2-  If  there  are  m  entities  of  type  £i  and  n  of  type  £2,  we  write  X  G  R™xn 
for  our  matrix  of  observations,  and  U  £  y  g  low-rank  factors.  A  factorization 

algorithm  can  be  defined  by  the  following  choices,  which  are  sufficient  to  include  most  existing  approaches 
(see  Sec.  2.2  for  examples): 

1.  Prediction  link  /  :  R™xn  ^ 

2.  Loss  function  'D{X,  f{UV'^))  >  0,  a  measure  of  the  error  in  predicting  f{UV'^)  when  the  answer  is  X. 

3.  Optional  data  weights  W  £  R™^",  which  if  used  must  be  an  argument  of  the  loss. 

4.  Hard  constraints  on  factors,  {U,  V)  £  C. 

5.  Regularization  penalty,  TZ{U,  V)  >  0. 

For  the  model  X  f(UV'^),  we  solve: 

(1)  argmin  [V{X,  f{UV'^))  +  n{U,  V)]. 

{U,V)&C 

^Source  code  is  available  at  http://www.cs.cinu.edu/~ajit/cmf.  This  paper  is  a  longer  version  of  Singh  et  al.  [33]. 
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The  loss  !?(•,  •)  quantifies  «  in  the  model.  It  is  typically  convex  in  its  second  argument,  and  often  decomposes 
into  a  weighted  sum  over  the  elements  of  X.  For  example,  the  loss  for  weighted  SVD  [34]  is 

Vw{X,  UV^)  =  \\WQ{X-  uv^)\\l,,, 

where  0  denotes  the  element-wise  product  of  matrices. 

Prediction  links  /  allow  nonlinear  relationships  between  UV'^  and  the  data  X.  The  choices  of  /  and 
2?  are  closely  related  to  distributional  assumptions  on  X;  see  Section  2.1.  Common  regularizers  for  linear 
models,  such  as  .^p-norms,  are  easily  adapted  to  matrix  factorization.  Other  regularizers  have  been  proposed 
specifically  for  factorization;  for  example,  the  trace  norm  of  ,  the  sum  of  its  singular  values,  has  been 
proposed  as  a  continuous  proxy  for  rank  [35].  For  clarity,  we  treat  hard  constraints  C  separately  from 
regularizers.  Examples  of  hard  constraints  include  orthogonality;  stochasticity  of  rows,  columns,  or  blocks 
(e.^.,  in  some  formulations  of  matrix  co-clustering  each  row  of  U  and  V  sums  to  1);  non-negativity;  and 
sparsity  or  cardinality. 

2.1  Bregman  Divergences 

A  large  class  of  matrix  factorization  algorithms  restrict  V  to  generalized  Bregman  divergences:  e.g.^  singular 
value  decomposition  [16]  and  non-negative  matrix  factorization  [21]. 

Definition  1  ([17]).  For  a  closed,  proper,  convex  funetion  F  :  ^  the  generalized  Bregman  diver¬ 

gence  between  matrices  Z  and  Y  is 


]D>f(^||E)  =  F{Z)+F*{Y)-Y  oZ, 

where  Ao  B  is  the  matrix  dot  product  B)  =  AijBij  and  F*  is  the  eonvex  dual  (Fenchel-Legendre 
conjugate):  F*{p.)  =  supggdomF  [(^:  m)  -  -F(6»)]. 

If  F*  is  differentiable,  this  is  equivalent  to  the  standard  definition  [10,  11],  except  that  the  standard 
definition  uses  arguments  Z  and  \/F*{Y)  instead  of  Z  and  Y.  If  F  decomposes  into  a  sum  over  components 
of  Z,  we  can  define  a  weighted  divergence.  Overloading  F  to  denote  a  single  component  of  the  sum, 

nF{Z\\Y,  W)=Y,  W,,  {F{Z,,)  +  F*{Y,,)  -  Y,,Z,,) . 
ij 

Examples  include  weighted  versions  of  squared  loss,  F{x)  =  ,  and  I-divergence,  F{x)  =  a:  log  a;  —  x.  Our 

primary  focus  is  on  decomposable  regular  Bregman  divergences  [6] ,  which  correspond  to  maximum  likelihood 
in  exponential  families: 

Definition  2.  A  parametric  family  of  distributions  ifp  =  {pFix\0)  :  0}  is  a  regular  exponential  family  if 
each  density  has  the  form 

logpF{x\9)  =  logpo(a;)  +  -  F{9) 

where  0  is  the  vector  of  natural  parameters  for  the  distribution,  x  is  the  vector  of  minimal  sufficient  statistics, 
and  F{0)  is  the  log-partition  function 

F{0)  =  log  J po{x)  ■  exp{9^x)  dx. 

A  distribution  in  ip f  is  uniquely  identified  by  its  natural  parameters.  For  regular  exponential  families 

logpFix\9)  =  logpo{x)  -\-F*{x)  -]D>F*(a:||  /(6»)) 

where  the  matching  prediction  link  is  f{0)  =  XF{0)  [15,  4,  14,  6].  Minimizing  a  Bregman  divergence  under 
a  matching  link  is  equivalent  to  maximum  likelihood  for  the  corresponding  exponential  family  distribution. 
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The  relationship  between  matrix  factorization  and  exponential  families  is  seen  by  treating  the  data  matrix 
X  as  a  collection  of  samples,  X  =  {Xu, . . . ,  Xmn}-  Modeling  X  =  we  have  that  Xij  is  drawn  from 

the  distribution  in  tjjp  with  natural  parameter  {UV'^)ij. 

Decomposable  losses,  which  can  be  expressed  as  the  sum  of  losses  over  elements,  follows  from  matrix 
exchangeability  [2,  3].  A  matrix  X  is  row-and-column  exchangeable  if  permuting  the  rows  and  columns  of  X 
does  not  change  the  distribution  of  X .  For  example,  if  Ai  is  a  document-word  matrix  of  counts,  the  relative 
position  of  two  documents  in  the  matrix  is  unimportant:  the  rows  are  exchangeable  (likewise  for  words).  A 
surprising  consequence  of  matrix  exchangeability  is  that  the  distribution  of  X  can  be  described  by  a  function 
of  a  global  matrix  mean,  row  and  column  effects  {e.g.,  row  biases,  column  biases),  and  a  per-element  effect 
{e.g.,  the  natural  parameters  above).  The  per-element  effect  leads  naturally  to  decomposable  losses.  An 
example  where  decomposability  is  not  a  legitimate  assumption  is  when  one  dimension  indexes  a  time- varying 
quantity. 

2.2  Examples 

The  simplest  case  of  matrix  factorization  is  the  singular  value  decomposition:  the  data  weights  are  constant, 
the  prediction  link  is  the  identity  function,  the  divergence  is  the  sum  of  squared  errors,  and  the  factors  are 
unregularized.  A  hard  constraint  that  one  factor  is  orthogonal  and  the  other  orthonormal  ensures  uniqueness 
of  the  global  optimum  (up  to  permutations  and  sign  changes),  which  can  be  found  using  Gaussian  elimination 
or  the  Power  method  [16]. 

Variations  of  matrix  factorization  change  one  or  more  of  the  above  choices.  Non-negative  matrix  factor¬ 
ization  [21]  maximizes  the  objective 

(2)  X  o  log(17V^)  -  1  o  UV^ 

where  1  is  a  matrix  with  all  elements  equal  to  1.  Maximizing  Equation  2  is  equivalent  to  minimizing  the 
I-divergence  ID_f/(X  jj  log([/V^))  under  the  constraints  U,V  >Q.  Here  H{x)  =  a;log(a;)  —  x.  The  prediction 
link  is  f{6)  =  log(0). 

The  scope  of  matrix  factorizations  we  consider  is  broader  than  [17],  but  the  same  alternating  Newton- 
projections  approach  (see  Sections  4-5)  can  be  generalized  to  all  the  following  scenarios,  as  well  as  to  collective 
matrix  factorization:  (i)  constraints  on  the  factors,  which  are  not  typically  considered  in  Bregman  matrix 
factorization  as  the  resulting  loss  is  no  longer  a  regular  Bregman  divergence.  Constraints  allow  us  to  place 
methods  like  non- negative  matrix  factorization  [21]  or  matrix  co-clustering  into  our  framework,  (ii)  non- 
Bregman  matrix  factorizations,  such  as  max-margin  matrix  factorization  [32],  which  can  immediately  take 
advantage  of  the  large  scale  optimization  techniques  in  Sections  4-5;  (iii)  row  and  column  biases,  where  a 
column  of  U  is  paired  with  a  fixed,  constant  column  in  V  (and  vice-versa).  If  the  prediction  link  and  loss 
correspond  to  a  Bernoulli  distribution,  then  margin  losses  are  special  cases  of  biases;  (iv)  methods  based 
on  plate  models,  such  as  pLSI  [19],  can  be  placed  in  our  framework  just  as  well  as  methods  that  factor 
data  matrices.  While  these  features  can  be  added  to  collective  matrix  factorization,  we  focus  primarily  on 
relational  issues  herein. 


3  Relational  schemas 

A  relational  schema  contains  t  entity  types,  £i  . .  .£t-  There  are  rii  entities  of  type  i,  denoted  A 

relation  between  two  types  is  Si  Sj]  index  m  £  N  allows  us  to  distinguish  multiple  relations  between  the 
same  types,  and  is  omitted  when  no  ambiguity  results.  In  this  paper,  we  only  consider  binary  relations.  The 
matrix  for  Si  Sj  has  rii  rows,  nj  columns,  and  is  denoted  jf  have  not  observed  the  values 

of  all  possible  relations,  we  fill  in  unobserved  entries  with  0  (so  that  jg  ^  sparse  matrix),  and  assign 

them  zero  weight  when  learning  parameters.  By  convention,  we  assume  i  <  j.  Without  loss  of  generality,  we 
assume  that  it  is  possible  to  traverse  links  from  any  entity  type  to  any  other;  if  not,  we  can  fit  each  connected 
component  in  the  schema  separately.  This  corresponds  to  a  fully  connected  entity-relationship  model  [12]. 
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We  fit  each  relation  matrix  as  the  product  of  latent  factors,  where  C/^®^  6 

jjU)  g  -^UjXkij  fgj,  g  {1,2,...}.  Unless  otherwise  noted,  the  prediction  link  is  an 
element-wise  function  on  matrices.  If  £j  participates  in  more  than  one  relation,  we  allow  our  model  to  use 
only  a  subset  of  the  columns  of  for  each  one.  This  flexibility  allows  us,  for  example,  to  have  relations  with 
different  latent  dimensions,  or  to  have  more  than  one  relation  between  £i  and  £j  without  forcing  ourselves 
to  predict  the  same  value  for  each  one.  In  an  implementation,  we  would  store  a  list  of  participating  column 
indices  from  each  factor  for  each  relation;  but  to  avoid  clutter,  we  ignore  this  possibility  in  our  notation. 


4  Collective  Factorization 

For  concision,  we  introduce  collective  matrix  factorization  on  the  three-entity- type  schema  £i  ^  £2  ^  £3,  and 
use  simplified  notation:  the  two  data  matrices  are  X  =  and  Y  =  X^^^\  of  dimensions  m  =  ni,  n  =  n2, 
and  r  =  ns-  The  factors  are  U  =  V  =  U^'^\  and  Z  =  The  latent  dimension  is  /c  =  k\2  =  ^23- 

The  weight  matrix  for  X  is  W ,  and  the  weight  matrix  for  Y  is  W .  Since  £2  participates  in  both  relations, 
we  use  the  factor  V  in  both  reconstructions:  X  «  fi{UV'^)  and  Y  w  f2{VZ’^). 

An  example  of  this  schema  is  collaborative  filtering:  £i  are  users,  £2  are  movies,  and  £^  are  genres.  X  is 
a  matrix  of  observed  ratings,  and  Y  indicates  which  genres  a  movie  belongs  to  (each  column  corresponds  to 
a  genre,  and  movies  can  belong  to  multiple  genres). 

One  model  of  Bregman  matrix  factorization  [17]  proposes  the  following  decomposable  loss  function  for 

x^hiuv^y. 


Li{U,  V\W)  =  IDfi  (UV'^WX,  W)  +  Dg(0  ||  C/)  -k  ©//(O  ||  U), 

where  G{u)  =  \v?  12  and  H{v)  =  7f^/2  for  A,7  >  0  corresponds  to  (.2  regularization.  Ignoring  terms  that 
do  not  vary  with  the  factors  the  loss^  is 

Li(C/,  V\W)  =  {Wo  F{UV^)  -{WqX)o  UV^)  +  G*{U)  +  H*{V). 

Similarly,  if  Y  were  factored  alone,  the  loss  would  be 

L2{V,  Z\W)  =  \\Y,W)+  D^^(0  II  U)  -k  Bi{Q\\Z). 

Since  U  is  a  shared  factor  we  average  the  losses: 

(3)  L{U,V,Z\W,W)  =  aLi{U,V\W)  +  {I  -  a)L2{V,Z\W), 

where  a  G  [0, 1]  weights  the  relative  importance  of  relations. 

Each  term  in  the  loss,  Li  and  L2,  is  decomposable  and  twice-differentiable,  which  is  all  that  is  required 
for  the  alternating  projections  technique  described  in  Section  4.1.  Despite  the  simplicity  of  Equation  3, 
it  has  some  interesting  implications.  The  distribution  of  given  and  and  the  distribution  of 

(21  (3l  •  ...  (21 

Yjfe  given  Xj  ®  and  a;),  ® ,  need  not  agree  on  the  marginal  distribution  of  Xj  ' .  Extending  the  notion  of  row- 

(21 

column  exchangeability,  each  entity  x^  ®  corresponds  to  a  record  whose  features  are  the  possible  relations 
with  entities  of  types  £i  and  £3.  Let  X2.1  denote  the  features  corresponding  to  relations  involving  entities  of 
ifi,  and  J-2,3  the  features  corresponding  to  relations  involving  entities  of  £3.  If  the  features  are  binary,  they 
indicate  whether  or  not  an  entity  participates  in  a  relation  with  x^  ' .  The  latent  representation  of  x^  '  is 
Vj.,  where  UV^  and  Vj.Z'^  determines  the  distribution  over  J-2.1  and  J-2^3  respectively. 

^The  conference  version  of  the  paper  [33]  contains  an  erroneous  definition  of  L\(U,  V®|VF),  which  is  corrected  here. 
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4.1  Parameter  Estimation 


Equation  3  is  convex  in  any  one  of  its  arguments.  We  extend  the  alternating  projection  algorithm  for  matrix 
factorization,  fixing  all  but  one  argument  oi  C  =  L{U,V,Z\W,W)  and  updating  the  free  factor  using  a 
Newton-Raphson  step.  Differentiating  the  loss  with  respect  to  each  factor: 

(4)  VuC  =  a{WQ  -X))V  +  VG*{U), 

VvC  =  a{WQ  (h{UV^)  -  X))^  U+ 

(5) 

(1  -a)(wQ  {f2iVZ^)  -Y)')Z  + 

(6)  VzC  =  {l-a)(wQ  {f2{VZ^)  -  1/  +  vr{z). 

Setting  the  gradients  equal  to  zero  yields  update  equations  for  U,  V,  and  Z.  Note  that  the  gradient  step 
does  not  require  the  divergence  to  be  decomposable,  nor  does  it  require  that  that  the  matching  losses 
be  differentiable;  simply  replace  gradients  with  subgradients  in  the  prequel.  For  £2  regularization  on  U, 
G{U)  =  A||C/|p/2,  VG*{U)  =  U/X.  The  gradient  for  a  factor  is  a  linear  combination  of  the  gradients  with 
respect  to  the  individual  matrix  reconstructions  the  factor  participates  in. 

A  cursory  inspection  of  Equations  4-6  suggests  that  an  Newton  step  is  infeasible.  The  Hessian  with 
respect  to  U  would  involve  nk  parameters.  However,  if  Li  and  L2  are  each  decomposable  functions,  then  we 
can  show  that  almost  all  the  second  derivatives  of  C  with  respect  to  a  single  factor  U  are  zero.  Moreover, 
the  Newton  update  for  the  factors  reduces  to  row-wise  optimization  of  U,  V,  and  Z.  For  the  subclass  of 
models  where  Equations  4-6  are  differentiable  and  the  loss  is  decomposable,  define 

qiU,.)  =  a  {W,.  ©  {h{U,.V^)  -  X,.))  R  +  VG*{U,.), 
g(W)  =  a  {W.i  0  (/i(C/F,^)  -  X.,)f  U+ 

(1  -  a)  (Wi.  0  {f2{Va^)  -  W))  ^ 

q{z,.)  =  (1  -  a)  (w.i  ©  {f2{vzl)  -  V  +  vr  (Z,.). 

Since  all  but  one  factor  is  fixed,  consider  the  derivatives  of  q{Ui.)  with  respect  to  any  scalar  parameter  in  U : 
'X7ujs<l{Ui-)-  Because  Ujs  only  appears  in  q{Ui.)  when  j  =  i,  the  derivative  equals  zero  when  j  yf  i.  Therefore 
the  Hessian  is  block-diagonal,  where  each  non-zero  block  corresponds  to  a  row  of  U.  The  inverse  of  a 
block-diagonal  matrix  is  the  inverse  of  each  block,  and  so  the  Newton  direction  for  U,  can 

be  reduced  to  updating  each  row  Ui.  using  the  direction  [q{Ui.)][q' {Ui.)]~^ .  The  above  argument  applies  to 
V  and  Z  as  well,  since  the  loss  is  a  sum  of  per-matrix  losses  and  the  derivative  is  a  linear  operator. 

Any  (local)  optima  of  the  loss  C  corresponds  to  roots  of  the  equations  1,  and 

{q{Zi.)}\^-y.  We  derive  the  Newton  step  for  Ui., 

(7)  Ur-  =  U..-r,-q{U..)[q'{U..)]-\ 

where  we  suggest  using  the  Armijo  criterion  [30]  to  set  rj.  To  concisely  describe  the  Hessian  we  introduce 
terms  for  the  contribution  of  the  regularizer. 


G,  =  diag(V2G*(G,.)),  H,  =  dmg{X^  H*  {V,.)) ,  h  =  di&g{X^r{Z,.)), 


and  terms  for  the  contribution  of  the  reconstruction  error, 

=  diag(W,.  0  f[{U,.V^)),  D2,^  =  diag(W.i  © 

D3.,  =  diag(W,.  ©  f^{V,^Z)),  D4,.  =  diag(W.i  ©  f'2{VZl)). 
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The  Hessians  with  respect  to  the  loss  £  are 


q\U,.)  =  Vq{U,.)  =  aV^Di^.V  +  G, 

q'{Z,.)  =  VqiZ^.)  =  (1  -  a)V^Di^rV  +  h 

q'{v,.)  =  Vq{V,.)  =  aU^D2,^U  +  (1  -  a)Z^D^^a  +  H, 


Each  update  of  U,  V,  and  Z  reduces  at  least  one  term  in  Equation  3.  Iteratively  cycling  through  the  update 
leads  to  a  local  optima.  In  practice,  we  simplify  the  update  by  taking  one  Newton  step  instead  of  running 
to  convergence. 

4.2  Weights 

In  addition  to  weighing  the  importance  of  reconstructing  different  parts  of  a  matrix,  W  and  W  serve  other 
purposes.  First,  the  data  weights  can  be  used  to  turn  the  objective  into  a  per-element  loss  by  scaling  each 
element  of  X  by  (rim)~^  and  each  element  of  E  by  (rir)~^ .  This  ensures  that  larger  matrices  do  not  dominate 
the  model  simply  because  they  are  larger.  Second,  weights  can  be  used  to  correct  for  differences  in  the  scale  of 
Li(C/,  V)  and  £2(1^  Z)-  If  the  Bregman  divergences  are  regular,  we  can  use  the  corresponding  log-likelihoods 
as  a  consistent  scale.  If  the  Bregman  divergences  are  not  regular,  computing 

Dp,  {UV^  1 1 X,  W)/Df,  {VZ^  1 1  r,  IE) , 

averaged  over  uniform  random  parameters  U ,  V,  and  E,  provides  an  adequate  estimate  of  the  relative  scale 
of  the  two  losses.  A  third  use  of  data  weights  is  missing  values.  If  the  value  of  a  relation  is  unobserved,  the 
corresponding  weight  is  set  to  zero. 

4.3  Generalizing  to  Arbitrary  Schemas 

The  three-factor  model  generalizes  to  any  pairwise  relational  schema,  where  binary  relations  are  represented 
as  a  set  of  edges:  E  =  {(£  j)  :  £i  £j  A  i  <  j}.  Let  [U]  denote  the  set  of  latent  factors  and  [IE]  the  weight 
matrices.  The  loss  of  the  model  is 

L{[U]  I  [IE])  =  j]  lE^*^')))  +  ^  [  ^  )  Dg«(0  ]] 

{i,3)&E  i=l  \j:ii,j)GE  J 

where  Fbt)  defines  the  loss  for  a  particular  reconstruction,  and  defines  the  loss  for  a  regularizer.  The 
relative  weights  >  q  measure  the  importance  of  each  matrix  in  the  reconstruction.  Since  the  loss  is  a 
linear  function  of  individual  losses,  and  the  differential  operator  is  linear,  both  gradient  and  Newton  updates 
can  be  derived  in  a  manner  analogous  to  Section  4.1,  taking  care  to  distinguish  when  acts  as  a  column 
factor  as  opposed  to  a  row  factor. 

5  Stochastic  Approximation 

In  optimizing  a  collective  factorization  model,  we  are  in  the  unusual  situation  that  our  primary  concern  is 
not  the  cost  of  computing  the  Hessian,  but  rather  the  cost  of  computing  the  gradient  itself:  if  k  is  the  largest 
embedding  dimension,  then  the  cost  of  a  gradient  update  for  a  row  is  0{kJ2j-£^r^Sj  '^3)1  Hi®  cost  of 

a  Newton  update  for  the  same  row  is  0{k^  +  'Yhj-e-r^E-  Typically  k  is  much  smaller  than  the  number 
of  entities,  and  so  the  Newton  update  costs  only  a  factor  of  k  more.  (The  above  calculations  assume  dense 
matrices;  for  sparsely-observed  relations,  we  can  replace  Uj  by  the  number  of  entities  of  type  £j  which  are 
related  to  entity  x^\  but  the  conclusion  remains  the  same.) 
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The  expensive  part  of  the  gradient  calculation  for  is  to  compute  the  predicted  value  for  each  observed 
relation  that  entity  Xr  participates  in,  so  that  we  can  sum  all  of  the  weighted  prediction  errors.  One 
approach  to  reducing  this  cost  is  to  compute  errors  only  on  a  subset  of  observed  relations,  picked  randomly 
at  each  iteration.  This  technique  is  known  as  stochastic  approximation  [7].  The  best-known  stochastic 
approximation  algorithm  is  stochastic  gradient  descent;  but,  since  inverting  the  Hessian  is  not  a  significant 
part  of  our  computational  cost,  we  will  recommend  a  stochastic  Newton’s  method  instead. 

Consider  the  update  for  Ui.  in  the  three  factor  model.  This  update  can  be  viewed  as  a  regression  where 
the  data  are  Xi.  and  the  features  are  the  columns  of  V.  If  we  denote  a  sample  of  the  data  as  s  C  {1, . . . ,  n}, 
then  the  sample  gradient  at  iteration  r  is 

qriu,.)  =  a  (W,,  ©  (/(C/,.I/J)  -  X,s))  14.  +  VG*(C/,.), 

Similarly,  given  subsets  p  Q  {1, . . . ,  n}  and  q  C  {1, . . . ,  r},  the  sample  gradients  for  the  other  factors  are 

<7,(14.)  =  a  {Wpi  ©  {fiUp.V,^)  -  Xp,)f  Up.+ 

(1  -  a)  (w,g  ©  ifiv^.z^)  -  y,,))  z,.  +  Vi7*(y,), 

g,(Z,)  =  (1  -  a)  (Wsi  ©  ifiVs.Zl)  -  H,.  +  Vr  (Z,.). 

The  stochastic  gradient  update  for  U  at  iteration  r  is 

Ul+^=Ul-T-^riU..). 

and  similarly  for  the  other  factors.  Note  that  we  use  a  hxed,  decaying  sequence  of  learning  rates  instead  of  a 
line  search:  sample  estimates  of  the  gradient  are  not  always  descent  directions.  An  added  advantage  of  the 
fixed  schedule  over  line  search  is  that  the  latter  is  computationally  expensive. 

We  sample  data  non-uniformly,  without  replacement,  from  the  distribution  induced  by  the  data  weights. 
That  is,  for  a  row  Ui.,  the  probability  of  drawing  Xij  is  ^ij-  This  sampling  distribution  provides  a 

compelling  relational  interpretation:  to  update  the  latent  factors  of  Xr'^ ,  we  sample  only  observed  relations 
involving  Xr  ■  For  example,  to  update  a  user’s  latent  factors,  we  sample  only  movies  that  the  user  rated.  We 
use  a  separate  sample  for  each  row  of  U :  this  way,  errors  are  independent  from  row  to  row,  and  their  effects 
tend  to  cancel.  In  practice,  this  means  that  our  actual  training  loss  decreases  at  almost  every  iteration. 

With  sampling,  the  cost  of  the  gradient  update  no  longer  grows  linearly  in  the  number  of  entities  related 
to  ,  but  only  in  the  number  of  entities  sampled.  Another  advantage  of  this  approach  is  that  when  we 
sample  one  entity  at  a  time,  |s|  =  \p\  =  |(7|  =  1,  stochastic  gradient  yields  an  online  algorithm,  which  need 
not  store  all  the  data  in  memory. 

As  mentioned  above,  we  can  often  improve  the  rate  of  convergence  by  moving  from  stochastic  gradient 
descent  to  stochastic  Newton- Raphson  updates  [7,  8].  For  the  three- factor  model  the  stochastic  Hessians  are 

g;(c/,)  =  +  G„ 

q',{Z,.)  =  {\  -  a)V^ +  h, 

q'^iy^.)  =  aU^yyp.  +  (1  -  yzlyy^.  +  H,. 

where 

bi^i  =  diag(H4s  ©  /((C/j.yJ)),  £>2,*  =  <lia,g{Wpi  ©  f[{Up.Vb)), 

ba.i  =  diag{Wiq  ©  Zq.)),  =  diag(tysi  ©  fbVs-Zf.)). 

To  satisfy  convergence  conditions,  which  will  be  discussed  in  Section  5.1,  we  use  an  exponentially  weighted 
moving  average  of  the  Hessian: 

(8)  qr+i{^  =  {l  -  qr{-)  +  :^^q'r+i{-) 
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When  the  sample  at  each  step  is  small  compared  to  the  embedding  dimension,  the  Sherman-Morrison- 
Woodbury  lemma  {e.g.,  [7])  can  be  used  for  efficiency.  The  stochastic  Newton  update  is  analogous  to 
Equation  7,  except  that  77  =  l/r,  the  gradient  is  replaced  by  its  sample  estimate  q,  and  the  Hessian  is 
replaced  by  its  sample  estimate  q. 

5.1  Convergence 

We  consider  three  properties  of  stochastic  Newton,  which  together  are  sufficient  conditions  for  convergence 
to  a  local  optimum  of  the  empirical  loss  C  [8].  These  conditions  are  also  satisfied  by  setting  the  Hessian  to 
the  identity,  q{-)  =  Ik  —  be.,  stochastic  gradient. 

Local  Convexity:  The  loss  must  be  locally  convex  around  its  minimum,  which  must  be  contained  in  its  domain. 
In  alternating  projections  the  loss  is  convex  for  any  Bregman  divergence;  and,  for  regular  divergences,  has 
M  as  its  domain.  The  non-regular  divergences  we  consider,  such  as  Hinge  loss,  also  satisfy  this  property. 
Uniformly  Bounded  Hessian:  The  eigenvalues  of  the  sample  Hessians  are  bounded  in  some  interval  [— c,  c] 
with  probability  1.  This  condition  is  satisfied  by  testing  whether  the  condition  number  of  the  sample 
Hessian  is  below  a  large  fixed  value,  be.,  the  Hessian  is  invertible.  Using  the  £2  regularize!'  always  yields  an 
instantaneous  Hessian  q  that  is  full  rank.  The  eigenvalue  condition  implies  that  the  elements  of  q  and  its 
inverse  are  uniformly  bounded. 

Convergence  of  the  Hessian:  There  are  two  choices  of  convergence  criteria  for  the  Hessian.  Either  one  suffices 
for  proving  convergence  of  stochastic  Newton,  (i)  The  sequence  of  inverses  of  the  sample  Hessian  converges 
in  probability  to  the  true  Hessian:  lim,-^oo(gT-)“^  =  {q')~^.  Alternately,  (ii)  the  perturbation  of  the  sample 
Hessian  from  its  mean  is  bounded.  Let  Vr-i  consist  of  the  history  of  the  stochastic  Newton  iterations:  the 
data  samples  and  the  parameters  for  the  first  t  —  1  iterations.  Let  gr  =  Ogifr)  denote  an  almost  uniformly 
bounded  stochastic  order  of  magnitude.  The  stochastic  o-notation  is  similar  to  regular  o-notation,  except 
that  we  are  allowed  to  ignore  measure-zero  events  and  E[os{fr)]  =  fr-  The  alternate  convergence  criteria  is 
a  concentration  of  measure  statement: 

ElqrlVr-l]  =qr+  Os{1/t). 

For  Equation  8  this  condition  is  easy  to  verify: 

IT’t-i]  ~  ~  Qt-1  + 

since  Vt-i  contains  qr-i-  Any  perturbation  from  the  mean  is  due  to  the  second  term.  If  q  is  invertible  then 
its  elements  are  uniformly  bounded,  and  so  are  the  elements  of  since  this  term  has  bounded 

elements  and  is  scaled  by  2/r,  the  perturbation  is  Os(1/t).  One  may  fold  in  an  instantaneous  Hessian  that 
is  not  invertible,  so  long  as  the  moving  average  q  remains  invertible.  The  above  proves  the  convergence  of  a 
factor  to  the  value  which  minimizes  the  expected  loss,  assuming  the  other  factors  are  fixed.  With  respect  to 
the  alternating  projection,  we  only  have  convergence  to  a  local  optima  of  the  empirical  loss  C. 

6  Related  Work 

Collective  matrix  factorization  provides  a  unified  view  of  matrix  factorization  for  relational  data:  different 
methods  correspond  to  different  distributional  assumptions  on  individual  matrices,  different  schemas  tying 
factors  together,  and  different  optimization  procedures.  We  distinguish  our  work  from  prior  methods  on 
three  points:  (i)  competing  methods  often  impose  a  clustering  constraint,  whereas  we  cover  both  cluster  and 
factor  analysis  (although  our  experiments  focus  on  factor  analysis);  (ii)  our  stochastic  Newton  method  lets 
us  handle  large,  sparsely  observed  relations  by  taking  advantage  of  decomposability  of  the  loss;  and  (iii)  our 
presentation  is  more  general,  covering  a  wider  variety  of  models,  schemas,  and  losses.  In  particular,  for  (iii), 
our  model  emphasizes  that  there  is  little  difference  between  factoring  two  matrices  versus  three  or  more; 
and,  our  optimization  procedure  can  use  any  twice  differentiable  decomposable  loss,  including  the  important 


class  of  Bregman  divergences.  For  example,  if  we  restrict  our  model  to  a  single  relation  £i  £2,  we  can 
recover  all  of  the  single-matrix  models  mentioned  in  Sec.  2.2.  While  our  alternating  projections  approach  is 
conceptually  simple,  and  allows  one  to  take  advantage  of  decomposability,  there  is  a  panoply  of  alternatives 
for  factoring  a  single  matrix.  The  more  popular  ones  includes  majorization  [22] ,  which  iteratively  minimize 
a  sequence  of  convex  upper  bounding  functions  tangent  to  the  objective,  including  the  multiplicative  update 
for  NMF  [21]  and  the  EM  algorithm,  which  is  used  both  for  pLSI  [19]  and  weighted  SVD  [34].  Direct 
optimization  solves  the  non-convex  problem  with  respect  to  (U,  V)  using  gradient  or  second-order  methods, 
such  as  the  fast  variant  of  max-margin  matrix  factorization  [32]. 

The  next  level  of  generality  is  a  three-entity- type  model  ~  £2  ~  ^^3-  A  well-known  example  of 
such  a  schema  is  pLSTpHITS  [13],  which  models  document-word  counts  and  document-document  citations: 
£i  =  words  and  £2  =  £3  =  documents,  but  it  is  trivial  to  allow  £2  ^  £3-  Given  relations  £i  ~  £2  and  £2  ^  £3, 
with  corresponding  integer  relationship  matrices  and  the  likelihood  is 

(9)  C  =  o  log  (UV^)  -f  (1  -  o  log  {VZ^)  , 

where  the  parameters  C/,  V,  and  Z  correspond  to  probabilities  Uik  =  \  /ifc),  Vik  =  p{hk  \  and 

Zik  =  p(.x^t^  1  hk)  for  clusters  {hi, . . .  ^hx}-  Probability  constraints  require  that  each  column  of  U,  V'^ , 
and  Z  must  sum  to  one,  which  induces  a  clustering  of  entities.  Since  different  entities  can  participate  in 
different  numbers  of  relations  {e.g.,  some  words  are  more  common  than  others)  the  data  matrices 
and  XG3)  are  usually  normalized;  we  can  encode  this  normalization  using  weight  matrices.  The  objective. 
Equation  9,  is  the  weighted  average  of  two  probabilistic  LSI  [19]  models  with  shared  latent  factors  h^.  Since 
each  pLSI  model  is  a  one-matrix  example  of  our  general  model,  the  two-matrix  version  can  be  placed  within 
our  framework. 

Matrix  co-clustering  techniques  have  a  stochastic  constraint:  if  an  entity  increases  its  membership  in 
one  cluster,  it  must  decrease  its  membership  in  others  clusters.  Examples  of  matrix  and  relational  co¬ 
clustering  include  pLSI,  pLSI-pHITS,  the  symmetric  block  models  of  Long  et.  al.  [23,  24,  25],  and  Bregman 
tensor  clustering  [5]  (which  can  handle  higher  arity  relations).  Matrix  analogues  of  factor  analysis  place  no 
stochastic  constraint  on  the  parameters.  Collective  matrix  factorization  has  been  presented  using  matrix 
factor  analyzers,  but  the  stochastic  constraint,  that  each  row  of  sums  to  I,  distributes  over  the  alternating 
projection  to  an  equality  constraint  on  each  update  of  Ul  .  This  additional  equality  constraint  can  be  folded 
into  the  Newton  step  using  a  Lagrange  multiplier,  yielding  an  unconstrained  optimization  (c./.,  ch.  10  [9]). 
Comparing  the  extension  of  collective  matrix  factorization  to  the  alternatives  above  is  a  topic  for  future 
work.  It  should  be  noted  that  our  choice  of  X  =  UV'^  is  not  the  only  one  for  matrix  factorization.  Long 
et.  al.  [23]  proposes  a  symmetric  block  model  X  «  CiACj ,  where  Ci  G  {0,  and  C2  G  {0,  l}"2xfc 

cluster  indicator  matrices,  and  A  G  contains  the  predicted  output  for  each  combination  of  row  and 

column  clusters.  Early  work  on  this  model  uses  a  spectral  relaxation  specific  to  squared  loss  [23],  while  later 
generalizations  to  regular  exponential  families  [25]  use  EM.  An  equivalent  formulation  in  terms  of  regular 
Bregman  divergences  [24]  uses  iterative  majorization  [22,  36]  as  the  inner  loop  of  alternating  projection.  An 
improvement  on  Bregman  co-clustering  accounts  for  systematic  biases,  block  effects,  in  the  matrix  [I]. 

The  three-factor  schema  ~  £2  ~  £3  also  includes  supervised  matrix  factorization.  In  this  problem, 
the  goal  is  to  classify  entities  of  type  £2'-  matrix  contains  class  labels  according  to  one  or  more  related 

concepts  (one  concept  per  row),  while  XC3)  lists  the  features  of  each  entity.  An  example  of  a  supervised 
matrix  factorization  algorithm  is  the  support  vector  decomposition  machine  [31]:  in  SVDMs,  the  features 
are  factored  under  squared  loss,  while  the  labels  are  factored  under  Hinge  loss.  A  similar  model 
was  proposed  by  Zhu  et  al.  [39],  using  a  once-differentiable  variant  of  the  Hinge  loss.  Another  example  is 
supervised  LSI  [37],  which  factors  both  the  data  and  label  matrices  under  squared  loss,  with  an  orthogonality 
constraint  on  the  shared  factors.  Principal  components  analysis,  which  factors  a  doubly  centered  matrix 
under  squared  loss,  has  also  been  extended  to  the  three- factor  schema  [38]. 

Another  interesting  type  of  schema  contains  multiple  parallel  relations  between  two  entity  types.  An 
example  of  this  sort  of  schema  is  max-margin  matrix  factorization  (MMMF)  [32].  In  MMMF,  the  goal  is 
to  predict  ordinal  values,  such  as  a  user’s  rating  of  movies  on  a  scale  of  {1, . .  .,i?}.  We  can  reduce  this 
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prediction  task  to  a  set  of  binary  threshold  problems,  namely,  predicting  r  >  1,  r  >  2, . . . ,  r  >  i?.  If  we 
use  a  Hinge  loss  for  each  of  these  binary  predictions  and  add  the  losses  together,  the  result  is  equivalent 
to  a  collective  matrix  factorization  where  £i  are  users,  £2  are  movies,  and  £i  £2  for  u  =  \  . . .  R  are  the 
binary  rating  prediction  tasks.  In  order  to  predict  different  values  for  the  R  different  relations,  we  need  to 
allow  the  latent  factors  and  to  contain  some  untied  columns,  ie.,  columns  which  are  not  shared 
among  relations.  For  example,  the  MMMF  authors  have  suggested  adding  a  bias  term  for  each  rating  level 
or  for  each  (user,  rating  level)  pair.  To  get  a  bias  for  each  (user,  rating  level)  pair,  we  can  append  R  untied 
columns  to  and  have  each  of  these  columns  multiply  a  fixed  column  of  ones  in  To  get  a  shared 
bias  for  each  rating  level,  we  can  do  the  same,  but  constrain  each  of  the  untied  columns  in  to  be  a 
multiple  of  the  all-ones  vector. 

7  Experiments 

7.1  Movie  Rating  Prediction 

Our  experiments  focus  on  two  tasks:  (i)  predicting  whether  a  user  rated  a  particular  movie:  israted;  and 
(ii)  predicting  the  value  of  a  rating  for  a  particular  movie:  rating.  User  ratings  are  sampled  from  the  Netflix 
Prize  data  [29]:  a  rating  can  be  viewed  as  a  relation  taking  on  five  ordinal  values  (1-5  stars),  i.e.,  Rating(user, 
movie).  We  augment  these  ratings  with  two  additional  sources  of  movie  information,  from  the  Internet  Movie 
Database  [20]:  genres  for  each  movie,  encoded  as  a  binary  relation,  i.e.,  HasGenre(movie,  genre);  and  a  list 
of  actors  in  each  movie,  encoded  as  a  binary  relation,  i.e.,  HasRole (actor,  movie).  In  schema  notation  £i 
corresponds  to  users,  £2  corresponds  to  movies,  £3  corresponds  to  genres,  and  £4  corresponds  to  actors. 
Ordinal  ratings  are  denoted  £i  ~i  £2',  for  the  israted  task  the  binarized  version  of  the  ratings  is  denoted 
£1  ~2  £2-  Genre  membership  is  denoted  £2  ~  £3.  The  role  relation  is  £2  £4. 

There  is  a  significant  difference  in  the  amount  of  data  for  the  two  tasks.  In  the  israted  problem  we  know 
whether  or  not  a  user  rated  a  movie  for  all  combinations  of  users  and  movies,  so  the  ratings  matrix  has  no 
missing  values.  In  the  rating  problem  we  observe  the  relation  only  when  a  user  rated  a  movie — unobserved 
combinations  of  users  and  movies  have  their  data  weight  set  to  zero. 

7.1.1  Model  and  Optimization  Parameters 

For  consistency,  we  control  many  of  the  model  and  optimization  parameters  across  the  experiments.  In  the 
israted  task  all  the  relations  are  binary,  so  we  use  a  logistic  model:  sigmoid  link  with  the  matching  log-loss. 
To  evaluate  test  error  we  use  mean  absolute  error  (MAE)  for  both  tasks,  which  is  the  average  zero-one  loss 
for  binary  predictions.  Since  the  data  for  israted  is  highly  imbalanced  in  favour  of  movies  not  being  rated, 
we  scale  the  weight  of  those  entries  down  by  the  fraction  of  observed  relations  where  the  relation  is  true. 
We  use  £2  regularization  throughout.  Unless  otherwise  stated  the  regularizers  are  all  G{U)  =  10^1|f7l||./2. 
In  Newton  steps,  we  use  an  Armijo  line  search,  rejecting  updates  with  step  length  smaller  than  77  =  2~'^. 
In  Newton  steps,  we  run  till  the  change  in  training  loss  falls  below  5%  of  the  objective.  Using  stochastic 
Newton,  we  run  for  a  fixed  number  of  iterations. 

7.2  Relations  Improve  Predictions 

Our  claim  regarding  relational  data  is  that  collective  factorization  yields  better  predictions  than  using  a 
single  matrix.  We  consider  the  israted  task  on  two  relatively  small  data  sets,  to  allow  for  repeated  trials. 
Since  this  task  involves  a  three  factor  model  there  is  a  single  mixing  factor,  a  in  Equation  3.  We  learn  a 
model  for  several  values  of  a,  starting  from  the  same  initial  random  parameters,  using  full  Newton  steps. 
The  performance  on  a  test  set,  entries  sampled  from  the  matrices  according  to  the  test  weights,  is  measured 
at  each  a.  Each  trial  is  repeated  ten  times  to  provide  1-standard  deviation  error  bars. 

Two  scenarios  are  considered.  First,  where  the  users  and  movies  were  sampled  uniformly  at  random; 
all  genres  that  occur  in  more  than  1%  of  the  movies  are  retained.  We  only  use  the  users’  ratings  on  the 


10 


(a)  Ratings  (b)  Genres 

Figure  1:  Test  errors  (MAE)  for  predicting  whether  a  movie  was  rated,  and  the  genre,  on  the  dense  rating 
example. 


sampled  movies.  Second,  where  we  only  sample  users  that  rated  at  most  40  movies,  which  greatly  reduces 
the  number  of  ratings  for  each  user  and  each  movie.  In  the  first  case,  the  median  number  of  ratings  per  user 
is  60  (the  mean,  127);  in  the  second  case,  the  median  number  of  ratings  per  user  is  9  (the  mean,  10).  In 
the  first  case,  the  median  number  of  ratings  per  movie  is  9  (the  mean,  21);  in  the  second  case,  the  median 
number  of  ratings  per  movie  is  2  (the  mean,  8).  In  the  first  case  we  have  ni  =  500  users  and  n2  =  3000 
movies  and  in  the  second  case  we  have  ni  =  750  users  and  n2  =  1000  movies.  We  use  a  fc  =  20  embedding 
dimension  for  both  matrices. 

The  dense  rating  scenario,  Figure  1,  shows  that  collective  matrix  factorization  improves  both  prediction 
tasks:  whether  a  user  rated  a  movie,  and  which  genres  a  movie  belongs  to.  When  a  =  1  the  model  uses  only 
rating  information;  when  a  =  0  it  uses  only  genre  information. 

In  the  sparse  rating  scenario.  Figure  2,  there  is  far  less  information  in  the  ratings  matrix.  Half  the 
movies  are  rated  by  only  one  or  two  users.  Because  there  is  so  little  information  between  users,  the  extra 
genre  information  is  more  valuable.  However,  since  few  users  rate  the  same  movies  there  is  no  significant 
improvement  in  genre  prediction. 

We  hypothesized  that  adding  in  the  roles  of  popular  actors,  in  addition  to  genres,  would  further  improve 
performance.  By  symmetry  the  update  equation  for  the  actor  factor  is  analogous  to  the  update  for  the  genre 
factor.  Since  there  are  over  100,000  actors  in  our  data,  most  of  which  appear  in  only  one  or  two  movies,  we 
selected  500  popular  actors  (those  that  appeared  in  more  than  ten  movies).  Under  a  wide  variety  of  settings 
for  the  mixing  parameters  there  was  no  statistically  significant  improvement  on  either 

the  israted  or  rating  task. 

7.3  Stochastic  Approximation 

Our  claim  regarding  stochastic  optimization  is  that  it  provides  an  efficient  alternative  to  Newton  updates 
in  the  alternating  projections  algorithm.  Since  our  interest  is  in  the  case  with  a  large  number  of  observed 
relations  we  use  the  israted  task  with  genres.  There  are  rii  =  10000  users,  n2  =  2000  movies,  and  ns  =  22 
of  the  most  common  genres  in  the  data  set.  The  mixing  coefficient  is  a  =  0.5.  We  set  the  embedding 
dimension  of  both  factorizations  to  /c  =  30. 

On  this  three  factor  problem  we  learn  a  collective  matrix  factorization  using  both  Newton  and  stochastic 
Newton  methods  with  batch  sizes  of  25,  75,  and  100  samples  per  row.  The  batch  size  is  larger  than  the 
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(a)  Ratings 


(b)  Genres 


Figure  2:  Test  errors  (MAE)  for  predicting  whether  a  movie  was  rated,  and  the  genre,  on  sparse  rating 
example. 


number  of  genres,  and  so  they  are  all  used.  Our  primary  concern  is  sampling  the  larger  user-movie  matrix. 
Using  Newton  steps  ten  cycles  of  alternating  projection  are  used;  using  stochastic  Newton  steps  thirty  cycles 
are  used.  After  each  cycle,  we  measure  the  training  loss  (log- loss)  and  the  test  error  (mean  absolute  error), 
which  are  plotted  against  the  CPU  time  required  to  reach  the  given  cycle  in  Figure  3.  This  experiment  was 
repeated  hve  times,  yielding  2-standard  deviation  error  bars. 

Using  only  a  small  fraction  of  the  data  we  achieve  results  comparable  to  full  Newton  after  hve  iterations. 
At  batch  size  100,  we  are  sampling  1%  of  the  users  and  5%  of  the  movies;  yet  its  performance  on  test  data 
is  the  same  as  a  full  Newton  step  given  8x  longer  to  run.  Diminishing  returns  with  respect  to  batch  size 
suggests  that  using  very  large  batches  is  unnecessary.  Even  if  the  batch  size  were  equal  to  maxjni,  712; 
stochastic  Newton  would  not  return  the  same  result  as  full  Newton  due  to  the  1/r  damping  factor  on  the 
sample  Hessian. 

It  should  be  noted  that  rating  is  a  computationally  simpler  problem.  On  a  three  factor  problem  with 
ni  =  100000  users,  n2  =  5000  movies,  and  =  21  genres,  with  over  1.3M  observed  ratings,  alternating 
projection  with  full  Newton  steps  runs  to  convergence  in  32  minutes  on  a  single  1.6  GHz  CPU.  We  use  a 
small  embedding  dimension,  k  =  20,  but  one  can  exploit  common  tricks  for  large  Hessians.  We  used  the 
Poisson  link  for  ratings,  and  the  logistic  for  genres;  convergence  is  typically  faster  under  the  identity  link. 

7.4  Comparison  to  pLSI-pHITS 

In  this  section  we  provide  an  example  where  the  additional  flexibility  of  collective  matrix  factorization  leads 
to  better  results;  and  another  where  a  co-clustering  model,  pLSl-pHlTS,  has  the  advantage. 

We  sample  two  instances  of  israted,  controlling  for  the  number  of  ratings  each  movie  has.  In  the  dense 
data  set,  the  median  number  of  ratings  per  movie  (user)  is  11  (76);  in  the  sparse  data  set,  the  median  number 
of  ratings  per  movie  (user)  is  2  (4).  In  both  cases  there  are  1000  randomly  selected  users,  and  4975  randomly 
selected  movies,  all  the  movies  in  the  dense  data  set. 

Since  pLSI-pHITS  is  a  co-clustering  method,  and  our  collective  matrix  factorization  model  is  a  link 
prediction  method,  we  choose  a  measure  that  favours  neither  inherently:  ranking.  We  induce  a  ranking  of 
movies  for  each  user,  measuring  the  quality  of  the  ranking  using  mean  average  precision  (MAP)  [18]:  queries 
correspond  to  user’s  requests  for  ratings,  “relevant”  items  are  the  movies  of  the  held-out  links,  we  use  only 
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Figure  3:  Behaviour  of  Newton  vs.  Stochastic  Newton  on  a  three-factor  model. 

the  top  200  movies  in  each  ranking^,  and  the  averaging  is  over  users.  Most  movies  are  unrated  by  any  given 
user,  and  so  relevance  is  available  only  for  a  fraction  of  the  items:  the  absolute  MAP  values  will  be  small, 
but  relative  differences  are  meaningful.  We  compare  four  different  models  for  generating  rankings  of  movies 
for  users: 

CMF-Identity:  Collective  matrix  factorization  using  identity  prediction  links,  fi{0)  =  /2(d)  =  S  and 
squared  loss.  Full  Newton  steps  are  used.  The  regularization  and  optimization  parameters  are  the  same  as 
those  described  in  Section  7.1.1,  except  that  the  smallest  step  length  is  77  =  2“®.  The  ranking  of  movies  for 
user  i  is  induced  by  f{Ui.V'^). 

CMF-Logistic:  Like  CMF-Identity,  except  that  the  matching  link  and  loss  correspond  to  a  Bernoulli 
distribution,  as  in  logistic  regression:  fi{0)  =  f2{0)  =  1/(1  -I-  exp“®). 

pLSI-pHITS:  Makes  a  multinomial  assumption  on  each  matrix,  which  is  somewhat  unnatural  for  the 
rating  task — a  rating  of  5  stars  does  not  mean  that  a  user  and  movie  participated  in  the  rating  relation 
five  times.  Hence  our  use  of  israted.  We  give  the  regularization  advantage  to  pLSI-pHITS.  The  amount  of 
regularization  /3  £  [0, 1]  is  chosen  at  each  iteration  using  tempered  EM.  The  smaller  (3  is,  the  stronger  the 
parameter  smoothing  towards  the  uniform  distribution.  We  are  also  more  careful  about  setting  (3  than  Cohn 
et.  al.  [13],  using  a  decay  rate  of  0.95  and  minimum  (3  of  0.7.  To  have  a  consistent  interpretation  of  iterations 
between  this  method  and  CMF,  we  use  tempering  to  choose  the  amount  of  regularization,  and  then  fit  the 
parameters  from  a  random  starting  point  with  the  best  choice  of  f3.  Movie  rankings  are  generated  using 
p{movie\user) . 

Pop:  A  baseline  method  that  ignores  the  genre  information.  It  generates  a  single  ranking  of  movies,  in  order 
of  how  frequently  they  are  rated,  for  all  users. 

In  each  case  the  models,  save  popularity  ranking,  have  embedding  dimension  fc  =  30  and  run  for  at  most 
10  iterations.  We  compare  on  a  variety  of  values  of  a,  but  we  make  no  claim  that  mixing  information 
improves  the  quality  of  rankings.  Since  a  is  a  free  parameter  we  want  to  confirm  the  relative  performance 
of  these  methods  at  several  values.  In  Figure  4,  collective  matrix  factorization  significantly  outperforms 
pLSI-pHITS  on  the  dense  data  set;  the  converse  is  true  on  the  sparse  data  set.  Ratings  do  not  benefit  from 
mixing  information  in  any  of  the  approaches,  on  either  data  set.  While  the  flexibility  of  collective  matrix 
factorization  has  its  advantages,  especially  computational  ones,  we  do  not  claim  unequivocal  superiority  over 

®The  relations  between  the  curves  in  Figure  4  are  the  same  if  the  rankings  are  not  truncated. 
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(a)  Dense 


(b)  Sparse 


Figure  4:  Ranking  movies  for  users  on  a  data  set  where  each  movie  has  many  ratings  (dense)  or  only  a 
handful  (sparse).  The  methods  are  described  in  Section  7.4.  Errors  bars  are  1-standard  deviation. 


relational  models  based  on  matrix  co-clustering. 


8  Contributions 

We  present  a  unified  view  of  matrix  factorization,  building  on  it  to  provide  collective  matrix  factorization  as 
a  model  of  pairwise  relational  data.  Experimental  evidence  suggests  that  mixing  information  from  multiple 
relations  leads  to  better  predictions  in  our  approach,  which  complements  the  same  observation  made  in 
relational  co-clustering  [23].  Under  the  common  assumption  of  a  decomposable,  twice  differentiable  loss,  we 
derive  a  full  Newton  step  in  an  alternating  projection  framework.  This  is  practical  on  relational  domains 
with  hundreds  of  thousands  of  entities  and  millions  of  observations.  We  present  a  novel  application  of 
stochastic  approximation  to  collective  matrix  factorization,  which  allows  one  handle  even  larger  matrices 
using  a  sampled  approximation  to  the  gradient  and  Hessian,  with  provable  convergence  and  a  fast  rate  of 
convergence  in  practice. 
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A  Derivation  of  the  3-entity-type  model 

The  3-entity-type  model  consists  of  two  relationship  matrices,  a  m  x  n  matrix  X  and  a  n  x  r  matrix  Y .  In  the  spirit 
of  generalized  linear  models  we  dehne  the  low  rank  representations  of  the  relationship  matrices  to  be  X  «  fi{UV"^) 
and  Y  «  f2{VZ'^)  where  fi  :  ^  R™'’""  and  /2  :  R"’'"  ^  R"’'"  are  the  prediction  links^  and  U  £ 

V  £  and  Z  £  are  the  parameters  of  the  model  for  k  £  Z++,  a  positive  integer.  We  further  dehne  functions 

G  :  R’""'*’  ^  R™-’"'',  H  :  R"''''  ^  R"'''=,  I  :  R’"’"''  ^  R’"'‘'=,  to  model  prior  knowledge  about  our  parameters,  e.g., 
regularizers.  We  additionally  require  the  convex  conjugate, 

G*{U)=  sup  [UoA-G{A)], 

U  GdoTn.(A) 

where  U  o  A  =  tr{U^A)  =  J2ij  UijAij  is  the  matrix  dot  product.  The  overall  loss  function  for  our  model  is 
(10)  L{U,  V,  Z\W,  W)  =  aLi{U,  V\W)  +  (1  -  a)L2{V,  Z\W) 


where  we  introduce  hxed  weight  matrices  for  the  observations,  W  £  and  W  £  R"^''.  The  individual  objectives 

on  the  reconstruction  of  X  and  Y  are,  respectively, 

(11)  Li{U,V\W)  =  (WoFi{UV^)-{WQX)oUV'^'J +G*{U)+H*{V) 

=  E  (^1  ■  t/. E)  +  ^  E  ^  E  ’ 

ij  rs  pq 


(12) 


L2{V,Z\W) 


(w  O  F2{VZ'^)  -{WqY)o  VZ^'J  +  H*{V)  +  I*{Z) 

E  Wij  (t'2(Y,.zJ)  -  Y,. .  Y.xj)  +  ^  E  +  I E 

ij  pq  rs 


where  Fi  and  F2  are  element-wise  functions  over  matrices,  i.e.,  L\  and  L2  are  each  decomposable.  The  scalars 
a,b,c>  0  control  the  strength  of  regularization  on  the  factors  (the  larger  the  value,  the  weaker  the  regularizer) .  The 
objective  L{U,V,  Z\W,W)  is  convex  in  any  one  of  its  arguments,  but  is  in  general  non-convex  in  all  its  arguments. 
As  such,  we  use  an  alternating  minimization  scheme  that  optimizes  one  factor  U,  V,  or  Z  at  a  time.  This  appendix 
describes  the  derivation  of  both  gradient  and  Newton  update  rules  for  U,  V,  and  Z.  For  completeness.  Section  A.l 
reviews  useful  definitions  from  matrix  calculus.  The  gradient  of  the  objective,  with  respect  to  each  argument  is 
derived  in  Section  A. 2.  Finally,  by  assuming  the  loss  is  decomposable,  we  derive  the  Newton  update  in  Section  A. 3, 
whose  additional  cost  over  its  gradient  analogue  is  essentially  a  factor  of  k  times  more  expensive. 


A.l  Matrix  Calculus 

For  the  sake  of  completeness  we  define  matrix  derivatives,  which  generalize  both  scalar  and  vector  derivatives.  Using 
this  definition  of  matrix  derivatives,  we  also  generalize  the  scalar  chain  and  product  rules  to  matrices.  The  discussion 
herein  is  based  on  Magnus  et.  al.  [26,  27]. 

“^Throughout  this  appendix  we  assume  that  prediction  links  /  and  the  corresponding  loss  defined  by  F  are  decomposable, 
j.e.,  F  :  R  — >  R  and  /  :  R  — >  R  are  applied  element-wise  when  their  arguments  are  matrices. 
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Let  M  be  an  n  X  g  matrix  of  variables,  where  m.j  denotes  the  j-th  column  of  M.  The  vec-operator  yields  a  ng  x  1 
matrix  that  stacks  the  columns  of  M: 


vecM  = 


/m.A 

m.2 


\m.J 


While  there  are  several  common  (and  incompatible)  definitions  of  matrix  derivatives,  the  derivative  of  a  n  x  1  vector 
/  with  respect  to  a  m  x  1  vector  x  is  almost  universally  defined  as 


Df{x) 


=  1L 

dx'^ 


/dh 

dxi 


dh  \ 
dxm  \ 


The  matrix  derivative  of  an  m  xp  matrix  function  ip  of  an  n  x  g  matrix  of  variables  M  contains  mnpq  partial  derivatives, 
and  the  matrix  derivative  arranges  these  partial  derivatives  into  a  matrix.  We  define  the  matrix  derivative  by  coercing 
matrices  into  vectors,  and  using  the  above  definition  of  the  vector  derivative; 


Dip{M) 


dvec(p{M) 

9(vecM)^ 


a[<p(M)iii 

dmii 


d[ip(M)]jnp 


a[<p(M)iii  \ 

\ 


8[v(M)\ 

mp  j 
drrinq  / 


which  is  an  mp  x  ng  matrix  of  partial  derivatives.  This  definition  encompasses  vector  and  scalar  derivatives  as  special 
cases.  The  advantages  of  this  formulation  include  (i)  unambiguous  definitions  for  the  product  and  chain  rules,  and 
(ii)  we  can  easily  convert  Dip{M)  to  the  more  common  definition  of  the  matrix  derivative, 


dip{M) 

dM 


/8<p(M) 

8mii 

8<f>(M) 
\  8m.nl 


dtp(M)  \ 

8mig  \ 


dtp(M) 

dmnq  / 


via  the  first  identification  theorem  [27,  ch.  9].  We  additionally  require  the  Kronecker  product,  and  the  matrix  chain 
[27,  pg.  121]  and  matrix  product  [26]  rules: 


Definition  3  (Kronecker  Product).  Given  a  m  x  p  matrix  Q  and  a  n  x  q  matrix  M  the  Kronecker  product,  Q  ®  M , 
is  a  mn  x  pq  matrix: 


Q®M  = 


/QiiM 

WmiM 


QipM' 

QmpM, 


Definition  4  (Matrix  Chain  Rule).  Given  functions  ip  :  (pi  :  R^^""  ^  and  p2  '■  R"’^®  ^ 

the  derivative  of  p{M)  =  pi{Y),  where  Y  =  (p2(M),  is 

Dp{M)  =  Dpi(Y)  X  Dp2{M). 

where  Ax  B  (or  equivalently,  AB )  is  the  matrix  product. 

Definition  5  (Matrix  Product  Rule).  Given  an  mxp  matrix  pi{M),  a  pxr  matrix  (p2(M),  and  a  lx  s  matrix  M , 
the  derivative  of  pi{M)  x  <p2(M)  is 

D{pi  X  (P2)(M)  =  ((p2(M)^  0  Im)  X  Dp-i{M)  +  {Ir  ®  pi{M))  X  Dp2{M), 
where  A®  B  is  the  Kronecker  product. 
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A. 2  Computing  the  Gradient 

To  compute  the  derivative  of  Equation  10  with  respect  to  U,  V ,  and  Z  we  require  the  following  three  lemmas: 

Lemma  1.  For  any  differentiable  funetion  E  :  R  ^  R 

d(WoF(UV'^))  /  ,  d(W  oF(UV'^))  ,  , 

^ - ’-^=[WQf{UV^))v,  - ll=(wQf{UV^))  U. 

where  f  =  \7F. 

Proof.  We  only  derive  the  result  for  (p{U)  =  F{UV'^),  the  proof  is  similar  for  the  other  case.  <p{U)  can  be  expressed 
as  the  composition  of  functions:  ^piU)  =  F{Y),  Y  =  (p2{U),  p>2{U)  =  UV^ .  We  note  that 

dvecW  o  F{UV^) 


(13) 


DF{Y)  = 


d{YecUVT)T 


SYijWijF{YijF 


V  9^11  ■  ■  ■  dYmn  j 

=  )  ...  WmnfiU^.Vj.)) 

=  (vecW  Q  f{UV'^)Y  . 

Furthermore,  using  the  matrix  product  rule 


(14) 


_  _  .  dvecU  i9vecE^ 

Dip2{U)  =  {V  ®  7m)  X  - — —  +  {In  ®U)  X 


9(vec  7/)^ 


d{YBcU)T 


=  {V®Im). 


□ 


Combining  equations  13  and  14  using  the  matrix  chain  rule  and  the  identity  (vecTl)^  x  {B  ®  I)  =  (vecAB)^  yields 
Dip{U)  =  DF{Y)  X  Dip2{U)  =  (vec  {W  ©  f{UV^))  V)^ .  The  first  identification  theorem  [27,  ch.  9],  which  formally 

states  the  rules  for  rearranging  entries  of  Dtp{U)  to  get  ))  ^  completes  the  proof. 

Lemma  2.  For  fixed  matrices  W  and  X  and  variable  matrices  U  and  V 

"((^0^)°^^-^  =  (W0X)E,  "((^QS°^^^)=(W0X)^7/. 


dU 


dV 


Proof.  We  derive  the  result  for  d{X  o  UV'^)ldV ,  the  other  derivation  is  similar.  To  avoid  a  long  digression  into 
matrix  differentials,  we  prove  the  result  by  element-wise  differentiation.  Noting  that 


X  o  UV'^  =  X! 


'^jkVik  5 


we  compute  the  derivative  with  respect  to  Vpq: 

d{{WQX)oUV^) 

OVnn 


—  WjiXji—  UjkVik 


^^WjpXjpUjq  —  i{W  Q  X)  u'] 


Where  the  final  step  follows  from  the  fact  that  the  derivative  is  zero  unless  (*,i)  =  {p,q).  Since  the  result  holds  for 
all  p  G  {1, . . .  ,n}  and  q  G  {1,  ...,k}  it  follows  that  d{X  o  UV’^)jdV  =  X^U.  □ 


Lemma  3.  For  the  £2  regularizers  where  a,b,c  >  0  controls  the  strength  of  regularizers  (larger  values 
regularization): 


weaker 


a\\U\\] 


G{U)  = 

the  derivatives  for  the  convex  conjugates  are 

dG^U)  U 

dU  a 


H{V)  = 


dH*{V)  V 
dV  b 


cll^ll 


I{Z)  = 


dr{z)  z  ^ 

dZ  c 
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Proof.  The  result  is  easily  proven  by  finding  the  convex  conjugate  and  differentiating  it. 


□ 


Combining  Lemmas  1,2,  and  3,  and  denoting  the  Hadamard  (element-wise)  product  of  matrices  Q  ©  M,  we  have 
that  for  fi  =  VFi  and  /2  =  VF2 

(15)  Z)  ^  ^  ^  [muv^)  -x))v  +  A, 

(16)  ®  {MUV^)  -X^Yu  +  {l-a)(w@  [f2{VZ^)  -Y^Z  +  B, 

(17)  =  (1  -  «)  Q  {mvz^)  -y)Yv  +  c. 

Setting  the  gradient  equal  to  zero  yields  update  equations  either  for  A,  B,  C  or  for  U ,  V,  Z.  An  advantage  of  using 
a  gradient  update  is  that  we  can  relax  the  requirement  that  the  links  are  differentiable,  replacing  gradients  with 
subgradients  in  the  prequel. 


A. 3  Computing  the  Newton  Update 

One  may  be  satisfied  with  a  gradient  step.  However,  the  assumption  of  a  decomposable  loss  means  that  most  second 
derivatives  are  set  to  zero,  and  a  Newton  update  can  be  done  efficiently,  reducing  to  row-wise  optimization  of  U,  V, 
and  Z.  For  the  subclass  of  models  where  Equations  15-17  are  differentiable  and  the  loss  is  decomposable  define, 

q{Ui.)  =  a  (Wi.  Q  (/i (Ui. V'^)-Xi.Yv  +  Ai., 

q{Vi.)  =  a  (Wi.  0  (/i(t/F)^)  -  Xi-Y^U  +  il-a)  (Wi.  0  (^h{Vi.Z'^)  -  Ei.))  Z  -b  Hi., 
q{Zi.)  =  (1  -  a)  (W.i  ©  [h{VZl)  -Y.^Y  y  +  Ci.. 

Any  local  optimum  of  the  loss  corresponds  to  a  simultaneous  root  of  the  functions  {q{Ui.)}YL\,  {'/(Hi  )})©!,  and 
{q{Zi.)Yi=i.  Using  a  Newton  step,  the  update  for  Ui.  is 

(18)  Ur'"  =  U.  -  77  [q{Ui.)]  [q'iUi.)]-^  , 

where  q  £  (0, 1]  is  the  step  length,  chosen  using  line  search  with  the  Armijo  criterion  [30,  ch.  3].  The  Newton  steps 
for  Vi.  and  Zi.  are  analogous.  To  describe  the  Hessians  of  the  loss,  q' ,  we  introduce  the  following  notation  for  the 
Hessians  of  G* ,  H*  and  /*: 

Gi  =  diag(V^G*([/i.))  =  diag(a“^l), 

H,  =  diag(V^H“(Ui.))  =  diag(fe-A), 

h  =  diag(V^7*(Zi.))  =  diag(c“U). 

For  conciseness  we  also  introduce  the  following  terms; 

Z)i,i  =  diag(lUi.  ©  fiiUi.V'^)),  D2,i  =  dmgiW.i  ©  f[{UV^)), 

D3,i  =  diag(lUi.  ©  MV^Z)),  D4,i  =  dmg{W.i  ©  fYVZl)). 


This  allows  us  to  describe  the  Hessians  of  Equation  10  with  respect  to  each  row  of  the  parameter  matrices. 
Lemma  4.  The  Hessians  of  Equation  10  with  respect  to  Ui.,  Vi.,  and  Zi.  are 


q{Ui.) 


q{Zi.) 


9'(U) 


dq{Ui.) 

dUi. 

dq{Zi.) 

dZi. 

dq{Vi.) 

dVi. 


=  aV'^  DgiV  +  Gi, 

=  (1  -a)U^D4.iU  +  7i, 

=  aU^D2,iU  -b  (1  -  a)Z'^D3,iZ  +  Hi. 
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Proof.  We  prove  the  result  for  q'(Ui.),  noting  that  the  other  derivations  are  similar.  Since  q{-)  and  its  argument  Ui. 
are  both  vectors,  Dq{Ui.)  is  identical  to  dq{Ui.)/dUi..  Ignoring  the  terms  that  do  not  vary  with  Ui., 


Dq{Ui.)  =  D  [a(m.  0  f{Ui.V'^))V  +  Ai^ 

=  aD  [(Wi.  0  fiUi.V'^))V^  +  DAi. 


=  a  <  {V'^  0  1)  X 


dveciWi.  Q  fiUi.V'^)) 


+  0/(f/i.V^)))  X 


diyecUi-Y 

=  a  0  1)  X  Di,iV  +  (4  0  {Wi.  0  fiUi.V'^)))  X  0}  +  Gi 

=  aV'^Di,iV  +  Gi. 

The  introduction  of  Di^iV  above  follows  from  observing  that 

'dvec{Wi.ef{Ui.v'^))\  _  d {Wip Q fiUi.v;;:)) 


dvecV  \  _^dvecVG*{Ui.) 


d{YecUi.)T  j  d(vecUi.)'r 


d  vec  Ui- 


dUic 


=  Wipf'iUi.V;^)Vp, 


which  is  simply  the  scaling  of  the  row  of  V  by  ujip.  □ 

An  alternate  form  for  the  updates,  known  as  the  adjusted  dependent  variates  form,  can  be  derived  by  plugging 
the  gradient  q{Ui.)  and  the  Hessian  q'{Ui.)  into  Equation  18; 

(19)  Ur'~q'{Ui.)  =  Ui.iaV^Di,iV  +  Gi)  Tap  (Wi.  0  (w.  -  /(t/i.H^)))  H  -  rjAi. 

=  aUi.V^Di,iV  +  ar,  (Wi.  0  (Xi.  -  /(f/i.l/^)))  DflDi,iV  +  Ui.Gi  -  rjAr 
=  a  (Ui.V'^  +  p  (Wi.  0  (Xi.  -  /(t/i.H^))  Di,iV  +  Ui.Gi  -  vAi. 

Likewise  for  Zi., 

(20)  Zr^q'iZi.)  =  (1  -  a)  (^Zi.V^  +  r)  (w.t  0  (v.i  -  f{VZl))y  Di,iV  +  Zi.U  -  rjGi. 

The  derivation  of  the  update  for  Vi.  is  similar,  since  L{U,V,  Z\W,W)  is  a  linear  combination  and  the  differential 
operator  is  linear: 

(21)  Vr^q'iVi.)  =  a  I  (Vi.U^  +  V  {W.i  0  (Xi  -  f{UK^))Y  Dj.if/j  + 

(1  -  a)  {  (Vi.Z'^  +  rj  (Wi.  0  (Vi.  -  /(Xi.Z^)))  43- J)  43, iZ}  + 

Vi. Hi  -  r]Bi. 

While  4  £  may  not  be  invertible,  i.e.,  a  diagonal  entry  is  zero  when  the  corresponding  weight  is  zero, 

the  form  of  the  update  equations  shows  that  this  does  not  matter.  If  a  diagonal  entry  in  4  is  zero,  replacing  its 
corresponding  entry  in  4“^  by  any  nonzero  value  does  not  change  the  result  of  Equations  19,  20,  and  21,  as  the  zero 
weight  cancels  it  out. 

B  Generalized  and  Standard  Bregman  Divergence 

We  prove  that  the  Generalized  Bregman  divergence  between  0,  a;  £  R, 

]D)F(6'||a:)  =  F{e)+F*{x)  -  Ox, 
is  equivalent  to  the  standard  definition  [10,  11], 

Df>{x  II  f{e))  =  F*{x)  -  4*(/(0))  -  V4*(/(0))(a;  -  /(0)), 


20 


when  F  is  convex,  twice-differentiable,  closed  and  /  =  V-F  is  invertible.  By  the  definition  of  the  convex  conjugate, 


0^(6^  li  a:)  =  F{9)  +  sup  {cj>x  —  F{(j))}  —  Ox, 
<P 


where  at  the  supremum  0  =  f  ^(x).  Therefore, 

Df(6I  II  a;)  =  F{e)  +  f~^{x)  ■  x  -  F{f-\x))  -  Ox 
=  F{e)  -  F{f-\x))  -  x{e  -  f-\x)) 

=  F{e)  -  F{r\x))  -  VF{f-\x)){e  -  r\x)) 

^Dp{e\\f-\x)). 

For  the  above  choice  of  F,  DF{0\\f~^{x))  =  DF*{x\\f{9)).  When  F  is  the  log-partition  function  of  a  regular 
exponential  family,  it  satisfies  the  properties  required  for  the  equivalence  between  generalized  and  standard  Bregman 
divergences. 
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